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Abstract 


This is an introduction to numerical Particle-Mesh techniques, which 
are commonly used to model plasmas, gravitational N-body systems 
and both compressible and incompressible fluids. The theory behind 
this approach is presented, and its practical implementation, both for 
serial and parallel machines, is discussed. This document is based on a 
four hour lecture course presented by the author at the NASA Summer 
School for High Performance Computational Physics, held at Goddard 
Space Flight Center. 




1 Introduction 


Particle simulation techniques attempt to model many-body systems by 
solving the equations of motion of a set of particles or pseudo-particles which 
are used to represent the system. Particle-mesh(PM) techniques represent a 
popular variant on this theme, in which a numerical mesh is added to more 
effectively compute the forces acting on these model particles. 

Otherwise known as Particle-In-ceU(PIC), particle-mesh codes come in 
two basic flavors, pure particle-mesh, and a combination of particle-mesh 
and particle-particle known as P^M [1]. In this chapter we wiU only consider 
pure particle-mesh. 

The particle-mesh technique was originally invented at Los Alamos(circa 
1955 - Harlow et al) for application in compressible fluid flows [4]. It was 
essentially reinvented in the mid sixties by Buneman and by Hockney for 
application to plasmas. Since then it has been applied most often to plas- 
mas, but also to gravitational N-body systems, in solid state device design, 
compressible fluid flow, incompressible fluid flow(the vortex method), and 
MHD. 

In the fluid and MHD applications, the particles are introduced as a 
numerical artifice to add an appealing lagrangian character to the model. 
We will mention these applications only briefly. Our main focus will be 
those applications which model true particle systems. More specifically we 
will concentrate on those in which the systems particles interact with each 
other through long range forces. In what follows, unless otherwise stated, we 
shall assume that our objective is to model a system of particles interacting i 
through long range forces. 

P^M is used almost exclusively for modelling gravitational N-body sys- 
tems. 

Tracking particle trajectories enables us to explore physical effects which 
are inaccessible to other modelling techniques. For example the more com- 
mon fluid modelling approaches average over the velocity distributions of 
the particles in the system. To do this they implicitly assume a form for 
this distribution, which is invariably close to equilibrium. But much of 
the interesting physics of these systems is associated with timescales before 
these equilibrium velocity distributions can be established. To explore this 
physics we need to resolve this structure in velocity space. There are two 
techniques we might use to study structure in velocity space, a continuum 
approximation based on a Boltzmann or Vlasov type equation, or a particle 
code. Particle codes are 
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• simpler to code up and analyze than Vlasov solvers 

• more robust 

• have a lagrangian character which lends them a certain economy 

• can be applied in 2 or 3D where Vlasov solvers are generally imprac- 
tical. 


However they are generally noisier than Vlasov solvers. 

The appeal of particle codes is rather obvious. By modelling a system, 
such as a plasma using particles, we are automatically incorporating much 
of the systems structure in our model. This has many computational ad- 
vantages, Instead of working with a complex set of fluid equations, or a 
Fokker-Planck equation, we get to work with the equations of motion of 
the particles which are simple ODEs. We never have to worry about negar 
tive mass or energy densities developing. Also particle codes have a natural 
adaptivity in the sense that computational effort is automatically focussed 
where the particles accumulate. 

Of course particle codes have limitations of their own. For example in 
most cases the real physical system has many more particles than we can 
cope with on a computer. As a result the forces acting on the individual 
particles in a numerical model are much noisier^ than in the real physical 
system and we are required to take measures to limit this noise. For systems 
in which the particles interact through long-range forces, the most naive 
implementation scales in a prohibitively expensive way and we are forced 
to develop more efficient algorithms. These problems limit the range of 
physical systems to which particle methods can be successfully applied. 

The best way to understand the limitations which apply to particle- 
mesh codes is by trying to construct a code. So let us consider how we 
would construct a particle model of an electrostatic plasma. 

1.1 Particle-Particle 

Let us assume that at time t we have Np particles in our system located at 
Xi(t) with velocities u,(t), where 1 < f < Np."^ The force on particle i is 

* Noisier in the sense that statistical fluctuations become more signiflcant. We will 
explore this point in more detail later. 

^Unless otherwise stated, variables subscripted with the letters i ox j will be assumed 
to identify properties associated with particles. 
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Figure 1: Flowchart for a Particle-Particle code. 


given by 


Fi 




X{ ~ Xj 
\Xi — Xj p 


( 1 ) 


where qi is the charge on particle i. The equations of motion for particle 
with mass rui are 



dvi F I 

dt mi 

The flowchart for a simple program to model this electrostatic system might 
look like figure 1 . The force on particle i is computed by summing its interac- 
tion with every other particle. We would describe this as a particle-particle 
code. It has a severe limitation. The number of arithmetic operations re- 
quired in the force evaluation scales as . In a 3D simulation the interac- 
tion between 2 particles requires approximately 10 floating point operations. 
For a code running for Nt timesteps the force evaluation requires roughly 
10 X Nt X iVp/2 floating point operations. On one processor of a YMP C-90 
which has a clock speed of 4 nanoseconds, a 1000 timestep calculation using 
a modest 10® particles would last more than 200 days. 
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So the particle-particle approach will only allow us to model a system 
in which the physics is determined by the interaction of a small(< 10®) 
number of particles. For any other system we need to reduce the scaling of 
the operation count in the force evaluation below order Np . This is done 
in one of two ways, by using a particle-mesh technique, or by using a tree 
code. Particle-mesh techniques are most effective when the particle density 
distribution is relatively uniform. Tree codes are favored in systems with 
large density contrast. 

1.2 Particle-Mesh 

In the particle-mesh approach we replace the force equation (1) with an 
evaluation based on continuum representations of the charge density and 
electric field. Poisson’s equation 

VV = -P(x) (2) 

relates the charge density p{x) to the electric potential 4>- The electric field 
is 

E{x) = - V(^ (3) 

and the force on particle i is 

Fi - qiE(xi) (4) 

We then use finite difference approximations on a mesh to solve equations 
(2)-(4). The steps in this process are 

• calculate p at each mesh point using the particle position information. 
This operation scales as order Np. 

• solve equation (2). If we use Fast Fourier Transforms(FFT’s) this 
scales as iV^lnA^^ where Wg is the number of mesh points. 

• evaluate E at each mesh point using equation (3). This operation 
scales as Ng . 

• use interpolation and equation (4) to evaluate the force on each parti- 
cle. This scales as Np. 

Combining these steps we can see that the number of floating point opera- 
tions for the complete scheme scales as 

(XNp -|- ^NghlNg -t- ‘jNg 
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Figure 2: Flowchart for a Particle-Mesh code. 


where a, /? and 7 are constants. The flow chart for this scheme is shown in 
figure 2. 

1.3 For Which Systems Does PM Work? 

Obviously when we introduced the mesh which we shall assume is uniformly 
spaced with cell size A, we sacrificed our ability to accurately model phe- 
nomena at length scales shorter than A. 

Consider the field produced by a single charge. When we assign this 
charge to the mesh we must decide which mesh points in the particles vicinity 
acquire charge. The mesh spacing A is the length scale characterizing the 
coarseness of this assignment. In a sense the particle has acquired a finite 
size. The force field produced by this ‘finite-sized’ particle will accurately 
reproduce that of a point particle at distances from the particle which are 
large compared with A. But no matter how we assign the charge, the force 
wiU become more inaccurate as the distance from the particle decreases 
toward A. 
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Now let us pose the question, for what types of physical system are 
particle-mesh codes appropriate? 

Suppose that the nature of the system is such that the force on any par- 
ticle is dominated by the contributions from its nearest neighbors. Systems 
of this type are often called ‘collisional’ or ‘correlated’. In these cases the 
force evaluation wiU be inaccurate unless we make A small compared with 
the typical distance of closest approach. Let us make the rather conservative 
assumption that the distance of closest approach {I dose ) sts large as 1/10^^ 
of the average inter-particle spacing. Then this accuracy criterion requires 

^ ^ ^dose ^ TTq 

10 

where L = is the characteristic physical dimension of the system. 

We can rewrite this as 

For example if 

which produces a force error of < 1% at closest approach, then 

Ng > lO^iVp, 

ie. more than 10® grid cells per particle are required. Clearly particle-mesh 
is an expensive way to achieve even modestly accurate individual particle 
trajectories in a ‘collisionar system, both in terms of the necessary flops and 
storage. It should come as no surprise that there are more efficient ways to 
achieve this accuracy (eg. tree codes which scale as 

Now consider the opposite case, where close neighbors contribute very 
little to the force on a particle which is dominated by the sum of its inter- 
actions with distant particles. This type of system is called ‘coUisionless’ 
or ‘uncorrelated’. At first sight this may seem an unlikely circumstance for 
a force law which falls off as but in fact plasmas, and most N-body 
gravitational systems fit this description. In these cases the important con- 
tributions to the force are accurately represented, since from the point of 
view of any particle most of the other particles in the system will be many 
cells away. So particle-mesh should be appropriate for collisionless, not col- 
lisional, systems. 
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lA Noise Reduction 

The granularity of a particle representation inevitably introduces short-scale 
fluctuations into the force field, which are superimposed upon a more slowly 
varying component. The mean amplitude of these fluctuations is propor- 
tional to y/n^ where n is the particle number density. The ratio of the mean 
amplitudes of the fluctuations to the slowly varying component varies as 
l/y/n. In the real system these fluctuations cause particles to be scattered 
at a frequency which we call the collision frequency. Because our numerical 
model typically uses far fewer particles than are present in reality, the effect 
of these fluctuations is greatly enhanced. This produces anomalously large 
collision frequencies. 

We need to take steps to reduce the significance of these fluctuations. 
Fortunately we do not need to reduce the fluctuation amplitudes to their 
correct values, but merely to levels at which they no longer dominate the 
forces on the particles, or influence the particles significantly during the 
course of our simulation. Remember, each particle contributes charge to 
some number, G, of cells in its neighborhood. The average value of C 
depends on the number of spatial dimensions, d, in the model, and on the 
shape and size of the particles. Suppose each particle is a uniform charge 
cloud of length w in each dimension. Then for w — A in ID we get C = 2, 
in 2D we get C = 2^ = 4 and in 3D we get C = 2^ = 8. In general we have 

The granularity with which the particles represent the charge distribution 
is reduced by increasing the number of charge contributions which are made 
in each cell. This requirement is expressed in the inequality 

CNp » Ng, 

or 



We can see that there are two ways to satisfy this inequality and reduce the 
importance of the statistical fluctuations, either by increasing Np/Ng, the 
number of particles per cell, or by increasing w/A, the ratio of the particle 
size to the grid size. 
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Figure 3: Charge assignment for NGP in ID. The asterisks denote the lo- 
cation of particles, and the filled circles the mesh cell centers. The arrows 
associated with each particle indicate to which grid cell all the particles 
charge is to be assigned. The dashed vertical lines denote mesh cell bound- 
aries. 


1.5 Charge Assignment and Force Interpolation 

Once we introduce the grid we can no longer view the particles as point 
particles. We have to specify how the particle charges are assigned to the 
mesh and how the electric field at the mesh points is used to determine the 
field acting on each particle. As we noted this leads naturally to the idea of 
a finite sized particle. 

The simplest charge assignment is called “Nearest Grid Point” (NGP). 
As the name suggests this associates a particles charge with the grid point 
nearest to the particle, as shown in figure 3. The most popular scheme 
is “Cloud-m-Cell”(GIG). In this case each particle can be regarded as a 
uniform distribution of charge, of width to, centered about the particle’s 
nominal location, as shown in figure 4. Usually w is set equal to the mesh 
ceU size A. CIC is slightly more expensive computationally than NGP, but 
has better numerical properties. 
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Figure 4: Charge assignment for CIC in ID. As in the previous figure the 
asterisk denotes the particle location. The different hatched areas indicate 
the relative proportions of the particle’s charge assigned to mesh cells g and 
^ + 1 . 

1.6 N-body Theory 

Before we can understand why PM works for plasmas and some gravitational 
N-body systems, we need to understand a little about how these systems 
behave. 

Our goal in N-body theory is to understand the evolution of macroscopic 
properties of the system. We can specify the initial conditions and boundary 
conditions for the macroscopic variables which interest us. The evolution of 
these macroscopic quantities depends on the behavior of the N bodies, but 
the macroscopic quantities do not uniquely identify the micro-state of the 
system. By micro-state we mean that the state of the system is defined in 
terms of the positions and momenta of every particle. There are an infinite 
number of micro-states which are consistent with a given macro-state. 

The theory for this problem is a statistical one. It imagines an infi- 
nite number of copies(micro-states) of the system, each consistent with the 
macro-state. The description of a typical micro-state is based on an ex- 
pansion about the average of this ‘ensemble’ of micro-states, the so called 
ensemble average. The ensemble average defines a smoothed continuous 
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force field. A true micro-state has the granularity that accompanies a dis- 
tribution of discrete charges. It can be represented by the sum of the average 
force field, and a term describing fiuctuations in the force field. If is the 
force field of the micro-state, and we use ( ) to denote ensemble averaged 
quantities, then 

^)+ SF^ 

where 6F^ are the fluctuations in the micro-state. We are not interested 
in the exact form of 6F^ for any particular micro-state. We are interested 
in the statistical properties of these fluctuations in the ensemble of micro- 
states. 

Let X = ( r , p ) denote a point in 6 dimensional phase space, and Xi = 
(ri,Pi) the particles coordinates. The micro-state is given by X = 
{xi,X 2 , ■ ■ ■ ,X]\f). The distribution function for X is denoted by fN{X,t), 
where fN{X,t)dX is the probability that the system is in a micro-state in 
the element dX at X- It evolves according to the LiouviUe equation 




dfN 

dt 



dfN „ dfN \ 

dn ^ ‘ ■ dpi ) 


= 0 . 


The microscopic phase density is 


N^{x, t ) = ^ 6{x - Xi ( t )). 
l < i<N 


The ensemble average oi is given by 

{N^ix,t)) 


where V is the systems volume in r-space. This equation defines fi(x,t) 
which is the probability that a particle will be found at x at time t, regardless 
of where aU the other particles in the system are located. Many of the 
interesting macroscopic properties of the system can be determined from /i , 
so we would like to solve for it. Since dN^ /dt = 0 we get 

dN ^ ^ dN ^ dp dN ^ 

dt ^ dt dr ^ dt dp 


/ Viif )) fN{yi , 2/2, • • 

■ , Vn ) (5) 

\< i<N 


N t 

— J dyiS{x - yi(t))/i(2/i, 0 

(6) 

yfl(x,t), 

(7) 
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or 


dN^ dN^ „ dN^ 

dt ^ dr ^ dp 

Taking the ensemble average of this equation gives 

Writing = {F^) + 8F’^ and {N^) = (iV/F)/i gives 


dt 


+ u ~ + (F^) • ^ = - {8F^ ^ 




dp 


SN^). 


dp 


If we set the right hand side of this equation, which is often referred to as the 
collision integral, to zero we get the Vlasov equation. This depends only on 
the ensemble averaged field (F^), It does not incorporate any effects due to 
the granularity which is present in any real micro-state. If we wish to include 
effects due to this granularity, we need to include the right hand side term 
which depends on the ensemble average of correlations in the fluctuations 
of the micro-states. For plasmas the simplest non-zero approximation for 
this term produces the Landau equation. A yet more detailed approxima- 
tion leads to the Balescu-Lenard equation which incorporates the dynamic 
polarization effect known as Debye shielding. 

What does 6F^ look like for systems in which the particles interact 
through an force? 

Consider a distribution of stars, all of mass m, with number density 
n(r, 0, <l>) in spherical coordinates. Let us assume for simplicity that n has 
no significant r dependence. The gravitational force on a star at r = 0 due 
to the stars in a volume element dr sin 0d9d(f> at r is 

m^Gn{0^ 4)v^dT sin 0d0d(j> r , 


where r is a unit vector in the radial direction. The force on the star due 
to all the stars in a shell of thickness dr and radius r is 


d(F^) = dr 


[/ 


n{0, <j>)r sin 6d0d(f> 


m^G. 


The term in the square brackets has no dependence on distance. So in this 
idealized case all shells of a given thickness contribute equally to the force on 
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the star at r = 0. Since there are more shells at large r than small r, distant 
interactions will dominate (F^). It is this behavior which introduces long 
wavelength collective modes to the system. 

The fluctuation term SF^ has a difierent r dependence. The contribu- 
tion to the fluctuations from a given volume element scales as the square 
root of the number of stars in that volume element, 

G^Jn{9^ (f>)r'^dr sin OdOdcf) r 
— m^G^Jn{9^ (f>) sin OdOdcf) r . 

Summing over 9 and (f) gives the contribution d(SF^) from the shell dr at 
r. Once again the factor in the square brackets is independent of distance 
so 

^)oc 

In a plasma close to equilibrium, Debye shielding adds an exponential factor 
to this behavior causing even faster fall off with distance. Xd is the 
Debye length. 

Our conclusion from this crude hand-waving argument is that fluctua- 
tions in the force are a short scale phenomena. If they are significant then 
we will need to model them accurately in our simulation. This means re- 
solving the interactions between close neighbors. We have seen however that 
PM codes are very inefficient for this type of system. On the other hand, 
if fluctuations in the force are not important, the system evolves under the 
influence of the long range ensemble average {F^), PM codes accurately 
represent forces on scales greater than a few mesh cell sizes, so we can expect 
them to do a good job for these ‘collisionless’ systems, for which they can 
be used to study collective modes in the wavelength range A < A < 

How do we determine whether force fluctuations are important in a real 
plasma or gravitational N-body system? The more particles that are in- 
volved in representing a given mass or charge distribution, the less granular 
the resulting force field will be and the less significant SF^ will become. 
In an homogeneous plasma only those particles within a distance Xd con- 
tribute to SF^ , Therefore as Nd — nX^ increases, the plasma becomes more 
collisionless. We can assume 

Nd » 1 

as a condition for coUisionless behavior. 
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For gravitational N-body systems the criterion is less clear-cut. Imagine 
a galaxy of radius R with N stars. For simplicity let us assume that within 
the galaxy the distribution of stars is uniform. For a star at the center of 
the galaxy the mean force (ie. the ensemble averaged force) is zero. Only 
the fluctuations persist. If we sum the individual binary interactions of a 
test star with all the stars in the galaxy, assuming each interaction can be 
considered independent, we can derive an expression for the rate at which 
the test star is deflected from its trajectory. Using Rutherford scattering 
theory we derive an expression for Av\ due to a typical interaction with a 
field star. Here ± denotes the plane perpendicular to the initial velocity of 
the test star. Then we sum this expression over aU possible interactions as 
the star crosses the galaxy once.^ The result is 

Avl 81n7V 
^ N " * 


So the condition we need to satisfy is that Av^jv"^ « 1, ie. 


81n7V 

N 


« 1 . 


This is a very crude and idealized estimate, but it gives us a more quanti- 
tative criterion for deciding if PM is an appropriate technique for a grav- 
itational N-body problem."* From this expression it is clear that galaxies 
(iV 10**) are coUisionless, globular clusters {N ^ 10^ — 10^) are marginally 
collisionless, while stellar clusters {N 10^) are dominated by fluctuations. 

There is a caveat to bear in mind with regard to gravitational systems. It 
is possible that some stars in the system might be unaflfected by fluctuations 
while at the same instant other stars are being dominated by fluctuations. 
This occurs because of spatial clustering which tends to develop in gravita- 
tional systems but not in plasmas. Conditions in the neighborhoods of two 
spatially separated stars can be significantly different. 

In both plasmas and gravitational N-body systems the particles inter- 
act with each other through a force, and so we might expect them to 
be similar in nature. However in the gravitational case the forces are al- 
ways attractive. In plasmas the interactions between particles can be either 
attractive or repulsive. As a result these systems have major differences. 

^This of course violates our assumption that the star is at the center of the galaxy, but 
our purpose here is to derive a crude estimate for which we consider this approximation 
acceptable. 

^Less crude criteria can be developed from the Landau equation. See chapter 8 of 
Binney and Tremaine [8]. 
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Gravitational N-body systems are subject to a collapse instability which 
does not occur for plasmas. When a density perturbation develops, the 
result is to more strongly attract surrounding material toward the denser 
region, reinforcing the original perturbation. Plasma density perturbations 
do not feed themselves in this way, 

The result is that gravitational systems generally possess very high den- 
sity contrast, with particles distributed in clusters which may be part of a 
hierarchy of clusters. Plasmas tend to be much more uniform. 

In addition plasmas are affected by a phenomenon known as Debye 
shielding. Every particle in the plasma has a tendency to attract those 
particles in its neighborhood with charge of opposite sign and repel parti- 
cles with charge of similar sign. This weakly correlated behavior has the 
effect of screening fluctuations on scales longer than the Debye length A^. 


1.7 Key features of a collisionless plasma 

Most of the pioneering studies of the properties of PM schemes were colli- 
sionless plasma calculations. So let us take a moment here to review some 
of the key features of collisionless plasmas in light of the points made in the 
previous section. 

The interaction forCe\^ r""^) is a long range force. In principle and in 
practice the system supports long wavelength modes involving coherent col- 
lective motion of many particles, 

The highest characteristic frequency associated with collective modes is 
the electron plasma frequency 




/47rne^\ I 

\ me ' 


The range of wavelengths of these coherent modes is bounded at the lower 
end by the Debye length Xa- Electron thermal motions dampen modes with 
A < Xd so strongly (Landau Damping), that we can say collective modes 
only exist for A > A^. 

Particles separated by less than A^ see each other as individuals - parti- 
cles separated by more than A^ interact with each other as participants in 
collective wave modes. 

The collision frequency z/e(= Toon) is the rate at which sub-A^ fluctuations 
scatter a particle. 


2tti>c 
— oc 

^pe 


nlnA 1 , 

/ — ^ AT 

Vn Nd 
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27TZ/r 




« 1-^ Nd » 1 


^pe 


In coUisionless systems, the A > coherent collective modes are the objects 
we wish to study. 

A collisionless system has a very large number of particles in a Debye 
sphere. A coUisional (fluctuation affected) system has relatively few. To 
properly model a coUisional system we must faithfuUy reproduce For 
a colUsionless system we can ignore snh-Xd fields. The chaUenge there is to 
make sure that while grossly under-representing the number of particles we 
maintain 

27Ti/c 


U). 


« L 


'pe 


One way to achieve this is by filtering out the sub-Ac/ scale components of 
the electric field. Since Xmin = 2A is the shortest wavelength which the grid 
will support, we can arrange this if we set A Xd- 


2 ESI - A ID Electrostatic Code 

ESI is a weU-known and documented ID electrostatic code, written by Bird- 
saU and Langdon, which is widely used as a teaching aid. For a more com- 
plete description of the code and a variety of simulation projects to which it 
can be appUed, the reader is referred to reference [2]. At this point we shaU 
present a quick outline. 

2.1 Integration of the Equations of Motion 
The particle equations of motion are 

dxi 

dvi F<i 
dt rrii 

order accurate in time for 
equations are 

Fr (8) 


ESI uses a leap-frog scheme which is second 
constant integration timestep At. The discrete 

n+l/2 n— 1/2 

U* — U* 
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Figure 5: The leap frog time integration. Note that the times at which v 
and X are knov/n are offset by At/2, 


^ n+l/2 

At ‘ 


( 9 ) 


where superscripts denote time levels. This is illustrated in figure 5. Note 
that the times at which a particles position are known are offset by At/ 2 
from the times at which its velocity is known. For an harmonic oscillator (ie. 
F (X X — xo) of frequency cjqj leap frog has no amplitude error for uoAt < 2, 
and has second order phase errors. Thus choosing At to satisfy wq A i = .3 
gives reasonable accuracy provided we do not run the integration beyond 
roughly lOOAt 


2.2 Integration of the Field Equations 
ESI solves the ID form of Poisson’s equation 




<t> = -p{x) 


and computes the electric field using 


F -JA 

dx' 


( 10 ) 


( 11 ) 
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Taking fourier transforms of these equations gives 

m = 4 ? ( 12 ) 

and 

E{k) = (13) 

For a discrete periodic system of length T, the fourier transform and inverse 
transform are defined by 

Ny-l 

f{ki) = Axj2 

fl=Q 


1 Ng-l 

fM = jE 

where ki — 27t//T« We could form the discrete transform p{ki) and then use 
equations (12) and (13) to find jE(fc;), and finally apply the inverse transform 
to get E{xg). However this mixes discrete representations p{ki) and E{k\) 
with continuous representations k^ and —ik for the operators and 

didx. ^ So instead we replace equations (10) and (11) with their finite 
difference approximations 

^ff+l ~ ‘^^9 d~ 4^g~l __ 


and 

where Ax is the mesh cell size, 
difference approximations gives 


2Ax 

Taking discrete transforms of these finite 


^{h) = 


p(fc/) 

Ki 


(14) 


and 

E(ki) ~ -iKi^{ki). (15) 


where 


and 


= ki 


sin kiAx 
kiAx 


K, = kf 


( sin \kiAx 
\kiAx 


®We win explore this further in section 4.7 
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PgiXi) 


◄ ► 

Ax 

Figure 6: The variation of charge from particle i assigned to mesh point g 
as the particle’s location Xi is altered, for NGP assignment. 




2.3 Assignment of Particle Charge to the Mesh 

We are given two choices by ESI. The lowest order and least expensive is 
Nearest Grid Point(NGP). In this case aU of a particles charge is assigned 
to the mesh cell in which the particle is located, as shown in figure 6. As a 
particle crosses the boundary between two cells, the charge assigned to both 
cells jumps dis continuously. This produces some noise in both p and E, 

The higher order alternative provided is Cloud‘-In-CeU(CIC). In this case 
each particle can be considered as a uniform charge cloud of width w. The 
default width is ty = Ax, the mesh spacing. This scheme is illustrated in 
figure 7. If the particle position Xi is located between the two mesh ceU 
centers Xg^\ and x^, the charge assigned to each of these cells is given by 


Xi 

pg = qi~ 


Ax 


and 


Pg-l 


= qi 


Xg Xi 

Ax 
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Figure 7: The variation of charge from particle i assigned to mesh point g 
as the particle’s location Xi is altered, for CIC assignment. 


2.4 Force Evaluation 

To ensure momentum conservation the same interpolation scheme is used to 
compute the force on a particle as was used to perform the assignment of the 
particles charge to the mesh. If we used NGP then the force on the particle 
is simply the force evaluated at the mesh point nearest to the particle. For 
CIC it is given by the formula 


Fi 


= 


{x^ Xg — i) 

Ax 


Eq + 


[Xg Xj^ 

Ax 



for Xg-i < Xi < Xg, There is also an energy conserving option which uses 
CIC charge assignment and NGP force evaluation. However this does not 
conserve momentum. 


3 Time Integration Schemes 

In this section we provide a brief review of the properties of a number of the 
most frequently used time integration algorithms. We have followed the for- 
mat adopted in chapter 4 of Hockney and Eastwood [1] to which the reader 
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is referred for more detail. We will consider three algorithms of diiferent 
order, the first order Euler scheme, also known as upwind differencing, the 
second order leapfrog algorithm which we have already encountered, and a 
fourth order Runge-Kutta algorithm. For the sake of brevity we will attempt 
to derive the properties of the leapfrog scheme only, while simply quoting 
the corresponding results for the two other schemes. 

Applying the Euler scheme to the particle equations of motion gives 


- uf _ 


mi 


At 


(16) 


= (17) 

The leapfrog scheme was given in equations (8) and (9). The fourth order 
Runge-Kutta scheme for a system of equations given by 


dz 

dt 




where z and f are vectors (ie. in our case z = ), is defined as 

= z" + hfei -b 2fe2 + 2^3 + ^ 4 ) + 0{At^) 

0 


with 

fei = At/(z",r) 
k2 = Atf{z- + ifci,r + iAt) 

fc3 = Atfiz- + ^k2X + l^t) 

fc4 = At/(z" + ^fc3,rd-^At). 

There are four important criteria to be considered when choosing an 
algorithm to integrate the particle equations of motion. 


• convergence 

• accuracy 

• stability 

• efficiency. 
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3.0.1 Convergence 

By convergence we mean that the numerical solution converges to the exact 
solution of the differential equation in the limits as At and Ax tend to 
zero. For linear schemes Lax has shown that consisteney and stability are 
necessary and sufficient conditions for convergence. Consistency requires 
that the difference approximation should reduce to the differential equations 
in the limit as At ^ 0. It should also preserve time reversibility. All 
three schemes reduce to the correct differential equations in the limit as 
At 0. However of the three only leapfrog satisfies the time reversibility 
requirement. 


3.0.2 Accuracy 


By accuracy we mean the truncation error associated with approximating 
derivatives with differences. If we combine the two equations defining the 
leapfrog approximation into one we get 

^n+l _ 2a.n ^ ^n-1 ^ 

At^ ~ m 

which can be compared with the exact expression 

d^x _ F 
m' 



Using Taylor series expansions for and x^ ^ about x^ gives 




X 




dx 

dt 

x^ - At— 
^ ^dt 


1 . . 2 ^^® 
1 A 2 ^^® 

+ 2 !^* W 


1 sd^x 

, ^ ~dW 

- W— 


3! dfi 


+ 0{Af) 

+ 0{At^). 


Combining these gives 

— 2a:” + a;”~^ d^x 


At2 


dt^ 


+ 0(At2). 


Thus leapfrog is second order accurate in time. 

It is relatively trivial to show that the Euler scheme is only first order. 
Showing that the Runge-Kutta scheme is fourth order is a very long and 
tedious task which wiU be left up to those readers with superhuman patience. 
It should be pointed out that order is generally but not always a reliable 
guide to the accuracy of a scheme. Each scheme has its complement of 
pathological applications which can cause it to break down. 
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3.0.3 Stability 

Do errors grow in time? If round-off error (ie the error introduced because 
the computer only stores numbers up to a certain precision) grows in time 
then the scheme is unstable. 

Consider again the leapfrog equation (18). Let be the numerical 
solution at time and be the exact solution (ie no round-off error) to 
this difference equation. We shall denote the numerical error at time by 

(19) 


Using equation (19) to replace x in equation (18) gives us an equation for 
the evolution of the error e with time, 




= — (f(X” + e") - F{X^)^ 
dF 



~ m dX X- ■ 


( 20 ) 

( 21 ) 


If we assume that we are looking at bounded oscillatory solutions of the 
form 

)n 

then by substituting this into the previous equation we get 


e" = (A)" = (e 


iujAt 


A2-2A + 1 = -i^AtfX 


( 22 ) 


where we assume also that 


This has solutions 


1 dF 


mdX 


Xn 




A± = 1 - \ 

2 ^ 


1T\ 1 




and the general solution is 


= aA!|: + 6A!!:. 

The scheme wiU be stable provided |Aj-| < 1. Figure 8 shows how A± varies 
as a function of ^lAt. When QAt < 2, A± has an imaginary part, but 
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Figure 8: The roots of equation (22) as a function of SlAt. For ^At < 2 
both roots have an imaginary component, but both have magnitude |A| = 1. 
For SlAt > 2, both roots are real and |A_| > 1. 


for Q,At > 2 both solutions are real. For Q,At < 2 we can easily show 
that |A±| “ 1. This means that not only is the leapfrog scheme stable for 
fiAt < 2, but it has the additional advantage that it suffers no amplitude 
dissipation. When OAt > 2 however |A_| > 1. Therefore to guarantee 
stability, we can calculate the largest value of \m~^dF/dX\ and then set At 
such that 


1 dF 
mdX 


^A^ < 2. 


The stability equation for the Euler scheme as given in equations (16) 
and (17) has roots 

A± — 1 i iQ,At 

which has |A±] > 1, and so is unconditionally unstable. The Runge-Kutta 
scheme is also unstable when applied to the particle equations of motion. 
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When we make the simplifying assumption that the force acting on the 
particle is proportional to the particle displacement, ie simple harmonic 
oscillation, the stability equation has solution 

A± = 1 - ± 

In this case \X±\ > 1 for aU II At except in a short interval about flAt cf 2.5. 

This analysis is more complicated for non-osciUatory solutions. There 
we need to check that grows more slowly than X'^. 

There is of course another requirement for accurate stable integration of 
the equations of motion. No particle can be allowed to move more than one 
mesh cell in distance during one timestep. 


3.0.4 Efficiency 

This is a critical consideration since whatever scheme we choose will be used 
for each particle at each timestep, a total which can be comfortably in excess 
of 10^® times. It is generally true of a lower order scheme that they 

• involve fewer intermediate time levels per timestep 

• require fewer stored intermediate values 

• require fewer floating point operations per timestep 

• have a greater stability range 

• require finer timesteps to achieve the same accuracy 

compared with a higher order scheme. The conventional wisdom is that the 
simple second order leapfrog achieves the best balance between accuracy, 
stability and efficiency. In ESI, with normalizations such that At = 1, and 
with the variable replacements 

_ At 

V V = V - — 

Ax 


X X = 

Ax 

then 

'^new — '^old “b 


^new 


— ^old “b '^old 
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where 


A(xoid) = aE(^Xoid) 

with a — qAt^ /rriAx. This implementation is remarkably efficient requiring 
just 4 fetches from memory, 2 additions, and 2 stores in memory per particle 
per timestep. 


4 Spatial Diseretization 

The mesh is involved in three separate stages of the code, 

1. assignment of the particle charge to the mesh 

2. solving the field equations 

3. interpolating mesh defined forces to the particle positions. 

All three steps introduce some error. However we should renaember that the 
only error that matters is the final combination of errors. As we shall see it 
is sometimes possible to manage these individual errors in ways which allow 
them to partially cancel each other and so produce a superior result. 

In discussing the ramifications of spatial discretization we will limit our- 
selves to the two schemes which we have seen already, NGP and CIC. To 
illustrate discretization errors we will consider a simple test problem, a pe- 
riodic ID electrostatic system with boundaries at x = ±i/2, and a uniform 
mesh with cell size A®, and we shall assume the following discretized field 
equations, 

^ 9+1 ~ ^ 4^9 "I" __ 

Ax2 - 

4’g^\ ~ 


Eg = - 


2 Ax 


Before we proceed there are a couple of function definitions which we need 
to establish. We define a cloud shape function S{x) which gives the charge 
density associated with a finite-sized particle at a: = 0. For NGP this is 




and for CIC, 


sw = ^n(x). 
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Figure 9: The weighting functions and their fourier transforms used in NGP 
and CIG schemes. The NGP weighting function Wnqp{x) is the hat function 
n(a:), and the CIC weighting function Wcici"^) is the triangle function f\{x). 


n is the hat function shown in figure 9 and defined in equation (23). The 
charge assigned to mesh point g from particle i located at Xi is computed 
from the weighting function 


W(Xi - Xg 



S{xi - 


x')dx' , 


The total charge at g is 

Np 

Pa = “ ^a) 

i=l 
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If the number density of particle centers is 


Np 

n{x) = '^S(x - Xi) 

i=i 

we can define a continuous charge density pc by analogy 


Pc- 


Q 


Note that pg = pc{xg)^ in other words pg can be obtained by sampling the 
continuous density distribution pc. 


4.1 NGP 

For NGP the weighting function is given by 


W{x) = U(x) = { 


1 1^1^ Aa;/2 

0 I ^ l> Ax/2. 

The mesh charge density is obtained by sampbng the continuous density 

n 

pg = Pc{x%^ = J-L/ 2 ^^^' ~ 


(23) 


where 


Np 

n{x) = 

i=l 

The force on particle i is given by the force evaluated at the nearest cell 
center. If Xg — Ax/2 < X{ < Xg + Ax/2 then 

F{xi) = qiE{xg) (24) 


Np-l 

= Qi W{Xi — Xg)Eg. 

3=0 


(25) 


Now, consider our simple test problem, a ID electrostatic system with 
periodic boundary conditions imposed at a: = ±L/2, and just two particles, 
a charge —q at x\, and a charge +g at X 2 , with x^ > x\. We shall label 
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Figure 10; The mutual force between a positive and negative charge as 
a function of their spatial separation, in a periodic system of length 16Aa:, 
using NGP(top frame) and CIC(bottom frame) charge assignment and force 
evaluation. 
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the center of the mesh cell containing the negative charge Xi , and the cen- 
ter of the mesh cell containing the positive charge ® 2 - Using NGP charge 
assignment, the potential at the mesh points is given by 


/3{Xg+ L/2) Xg <Xl 

4>g = \ + L/2) + q{Xg -Xi) Xi<Xg<X 2 

^ P{Xg 1/2) + q{xg - Xl ) - q{xg - X 2 ) X 2 < Xg 


where j3 = q{x\ — X 2 )/L. The force on the particle at X 2 is 


This is plotted in figure 10 as a function of the inter-particle separation 
for a system with L ^ 16Ax, Note that as we vary the particle 
separation the force jumps discontinuously when one of the particles crosses 
a mesh cell boundary. Another unfortunate property of this force is that 
it is not invariant under spatial translation. Suppose we vary (xi -|- X 2 )/ 2 , 
the mean position of the particles, while keeping their separation X 2 — 
fixed. Then the force fluctuates with period Aa?, as shown schematically 
in figure 11. This fluctuation in the inter-particle force is greatest at small 
separations. Minimizing this loss of displacement invariance is one of the 
most significant improvements we can make in designing a particle code. As 
we shall now see using a higher order scheme, such as CIC, achieves this. 


4.2 CIC 

With NGP we used one mesh point to define charge assignment and force 
interpolation, and we got a noisy result. CIC uses two mesh points. In this 
case the particle can be viewed as a uniform density charge cloud of width 
Ax. The charge assignment scheme is illustrated in figure 7. The weighting 
function is given by 


W(Xi Xg) — /\{Xi Xg) — 


1 I X^ Xg j j Ax 

0 


I ~ Xg |< 

otherwise 


(26) 

/\{x) is called the triangle function, and is shown in figure 9. We will leave 
it as an exercise to the reader to derive the electric potential on the mesh 
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Figure 11: Variation of the mutual force between a positive and negative 
charge as a function of their mean position but with fixed separation, for the 
two distinct cases when the particles are separated by more than Arc(top), 
and by less than Ax(bottom). NGP charge assignment and force evaluation 
are used. 

for the test problem considered in the previous section. An example of the 
interaction force between the two particle is plotted in the bottom frame of 
figure 10 as a function of the inter-particle separation. For X 2 — xi > 2Aic we 
recover the exact analytic answer for the ID problem.^ For X 2 — x\ < 2Aa:, 
the inter-particle force depends also on the positions of the particles with 
respect to the mesh. Figure 10 illustrates one example, where xi was chosen 
to be —QA/S.X from cell center while X 2 was varied. 

It is immediately apparent by comparing the two frames in figure 10 
that CIC gives much smoother forces than NGP. It is also obvious that 
the dependence of the inter-particle force on the mean particle position 
{x\ -h X2)/2 is much weaker than in NGP. Finally, the errors that do remain 
are more localized spatially than for NGP. 

®This occurs because in ID the exact analytic solution for the potential is piece- wise 
linear in x, CIC interpolation is capable of representing this variation exactly. This 
fortuitous match does not occur in 2 or 3D. 
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4.3 Momentum Conservation 

Both the NGP and CIC examples above conserve momentum. We can in 
fact show that momentum will be conserved provided we use centered dif- 
ferencing, and we use the same weighting function W for charge assignment 
and force interpolation. To prove that a scheme conserves momentum we 
need to demonstrate 

• that no self-forces act on the particles 

• that interacting particles impose equal but opposite forces upon each 
other. 

It is convenient at this point to develop some additional array notation. 
Let us consider a vector p such that each element of p is the charge density 
at a diflferent mesh point, ie p = (pi, . . . ,p^, — ^pNg)- Let us also construct 
similar vectors # and e from the mesh values of the electric potential and 
electric field. Differencing Poisson’s equation leads to a matrix equation 


= —p 

where A is an Ng x Ng matrix. Similarly, differencing 


produces 


E = -f 

OX 


e = -B4? 


where B is another Ng x Ng matrix. Combining these equations gives 


e = BA"V 

= Cp. 


(27) 

(28) 


Choosing the ditferencing scheme 


9x2 


^g+l ~ + 4>g-l 

Ax2 


means A and A ^ are symmetric. With 


d(f> 

dx 


^g+i ^ — 1 

2Ax 


B is anti-symmetric. Therefore C = BA ^ is also anti-symmetric. 
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4.3.1 Self-Forces 

Consider a charge qaXx. Let us assume for now that the weighting functions 
used to assign charge to the grid and to interpolate the force from the mesh 
to the particles are not necessarily the same, and we will denote them by 
and respectively. Consider a charge g at x. The force on this 
particle is 

Fix) = qY,WP{x-x,)E{xg). 

9 

The field E is given by 

^gg'pj^.Q^) 

g' 

and so therefore 

9 g' 

To calculate the self-force on the particle we consider only the contribution 
to p from the particle itself, ie. 

p{Xg.) Sp{Xg') = qW’iX - Xg>). 

The resulting self-force is 

- ^sW’ix- Xg>). 

9 9* 

Now if we assume that then the necessary condition to achieve 

f'seif — 0 is that C be anti- symmetric. We saw above that centered differ- 
encing makes C anti-symmetric. Thus we have established that the combi- 
nation of centered differencing and = W^ eliminates self- forces. 

4.3.2 Mixed NGP/CIC schemes 

What if ^ Consider a single particle with charge q located at 

X in an infinite ID system. Let us use Xm to denote the center of the cell 
containing the particle. 

First consider the combination of NGP charge assignment and CIC force 
interpolation. We can easily show that the potential mesh point g is 



32 



Using CIC for the force evaluation we get 




< 


“ x)/2Ax 


^771 ^ X ^w, H~ Ax/2 
x^ Air/2 <C ir x>jyi 


The self-force acts to drive the particle away from the cell center. This can 
produce instability. 

Now consider CIC charge assignment and NGP force interpolation. Let 
us further assume that the particle lies in grid cell m as before and is located 
between Xm and x^ + Ax/2. Using CIC charge assignment gives 


pm — 


f^ m+l 

Ax 


Pm+1 


= ( 


X — Xr 


q 

Ax 


Ax / Ax * 

Setting 4>m ~ 0 and assuming equal but opposite potential gradients at -too, 


le. 


as g 


4^g _ (f>-g 

Ax Ax 

0O5 we can easily show that 


“h Air) 


which implies that 


4^m 




-q 


2 


/X 

\ 2Ax ) 


The particle will perform simple harmonic oscillation about the nearest grid 
point with frequency 


^self 


2m Ax 


From our analysis of the leap-frog scheme we know this will be stable pro- 
vided the timestep is chosen to satisfy u>selfAt < 2. 

The general principle here is that we should always use a force interpo- 
lation scheme which is of no higher order that the scheme used for charge 
assignment. 
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4.3.3 Equal but opposite forces 

Now consider two particles, q\ at x\ and q 2 at X 2 ^ The force on particle 1 
due to particle 2 is 

Fx2 = q2qi v)- 

9 g' 

Similarly, the force on particle 2 due to particle 1 is 

p2\ = ~ ^g)Cgg'W{x^ - v)- 

9 g' 

But since C is anti-symmetric 

F 21 = -qiq2j2Yl^^i^^-^3')Cg>gWix2-Xg) (29) 

9 9 ' 

- ~Fx2^ (30) 

Note, this result again depend on C being anti-symmetric, and = 

W. 

We have highlighted three properties which we would like the charge 
assignment /force interpolation scheme to satisfy, 

• at particle separations large compared with the mesh spacing, the 
fluctuations in the inter-particle force due to displacement relative to 
the mesh should become negligible 

• as a particle moves across the mesh the force it experiences and the 
charge it assigns to the mesh should change smoothly 

• momentum should be conserved. 

Hockney and Eastwood [1] describe a hierarchy of schemes constructed on 
these principles (NGP, CIC, TSG, ...). Alternatively, schemes can be devel- 
oped by using multi-pole expansions of the charge distribution of each finite 
sized particle about the nearest grid point to the particle. To zeroth order 
this approach returns NGP. Including the dipole gives a scheme similar to 
CIC. 
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4.4 Aliasing 

The spurious fluctuations which appear as a result of the loss of displacement 
invariance, manifest themselves in fc-space as non-physical mode couplings, 
known as “aliasing”. Spatial structure on scales finer than Aa: cannot be 
represented on the mesh. Some of the power in this fine scale is erroneously 
interpreted by the mesh as belonging to longer wavelength modes which the 
grid does support. 

For a ID infinite system the density of particle centers is given by 


n{x) = 53 ~ 

i-i 


and the continuous charge density is 

Pc{x) = f Wix^ — x)n{x^)dx^ . 
/\X J — oo 


This has fourier transform 


with 


p,{k) = r dx p,{x)e-^^^ 

J — OO 
1 poo 

Pc{x)=—j_^dkp,{k)e^\ 


(31) 


By introducing a mesh we reduced our representation of p{x) from a con- 
tinuous representation Pc{^) to a sampled representation Ps{^g)^ The only 
wavelengths which can be represented on the mesh are A > 2 Ax, ie k < 
7t/Ax — kgrid/^^^ The appropriate transform for this discrete representation 
is the discrete fourier transform 


oo 

Ps{k) = Ax ^ ps{Xg)e~''^'^9 

5=~oo 


1 f^grid/'^ 

Psi^a) = ^ / dk ps{k)e^^^K (32) 

2ir J-kgridl'^ 

Equation (32) expresses Ps{^g) in terms of the transform of the sampled 
charge density, but we can also write it in terms of the transform of the 
continuous charge density (equation (31)), 

^.(x^) = p,(xg) = ^ dk P\{k)e^^^^ . (33) 
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Breaking up this integral into intervals of length kgrid, and setting k = 
k' + nkgridi equation (33) becomes 

OO 1 d/2 

M%)= E 7T I + (34) 

n=-00 J-kgridl'i 

Comparing equations (34) and (32) implies, 

OO 

n=— OO 

But inkgrid^g = 2'King and therefore 

^ThkgTri^Xg _ ^ 

so that 

OO 

Ps{k) = E /®c(^ + nkgrid). (35) 

n=— OO 

When we represent p on a discrete mesh instead of a continuum, we limit the 
independent wavelengths which the solution can contain to the wavenumber 
range —kgrid/2 < k < kgrul^, called the “principal zone” or “first BriUouin 
zone” . The transform of the charge density in our discrete representation is 
given by the sum of copies of the transform of the continuous charge density, 
each offset by a different multiple of kgru. This means of course that Ps(k) 
is periodic in k with period kgrU- The extra contributions (from jn] > 0) to 
Ps(k) inside the principal zone are called aliases. 

Note that for the infinitely long mesh there is a continuum of allowed 
wavenumbers in the principal zone. In the case of a finite sized mesh 
there would be just a discrete set of wavenumbers given by A: = 0 and 
k = 2ir/(nAx) where 1 < n < Ng. 

Aliasing is generally worse for k ~ kgridf2 than for A: ~ 0, ie the shortest 
wavelengths are most affected. Also, the smoother p is, ie the less power 
there is at short wavelengths, the less likely it is that aliasing will be a 
problem. Recall that 

Pc(®) = J n{x')W{x' — x)dx 
which implies by the convolution theorem that 

Pc(k) = ^Hk)W{k). (36) 
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◄ ► 

Principal Zone 

Figure 12: The fourier transform of the discrete sampled charge density, 
which is the sum of copies of the transform of the continuous charge density, 
each offset by a different integer multiple of kgru- 


Equation (35) implies that the narrower Pc{k) is in A:-space, the less aliasing 
will occur/ But from equation (36) we can see that we can narrow Pc{k) 
by narrowing W(k)^ or in other words by making the charge assignment 
smoother. 

For NGP we saw that Wnqp{x) = II(®), the hat function defined by 
equation (23) and shown in figure 9. The fourier transform of this is 

Txr A • fkAx\ . sin(fcAx/2) 

WNGP{k) = Ax smc(— ) = Ax 

If we use CIC then Wcic(^) = A(^)i the triangle function defined in equa~ 

^If pc(k) is band limited, in other words, a critical wavenumber kcr exists such that 
Pc{k) = 0 for |A;| > fccr, then no aliasing will occur provided kgrid > kcr- Under these 
circumstances p can be represented exactly on the mesh. When kgrid = kcr the mesh 
samples p at the Nyquist frequency. 
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tion (26)5 and also illustrated in figure 9. This has a transform 

Wcicik) = Ax 

In figure 9 we can see that Wcicik) is narrower than WNGp{k) and so less 
susceptible to aliasing. 

Aliasing wiU be produced by any mechanism which introduces wave- 
lengths shorter than 2 Ax. Low order charge assignment schemes do this. 
Setting the particle size w smaller than the mesh size will also do this. We 
get strong aliasing if we set w < Ax/ 10. This limits the dynamic range 
of wavelengths which we can include in our model to A < lONgW. This 
lower limit on w can be relaxed somewhat by using higher order assignment 
schemes. 

Setting the Debye length much less than Ax wiU also introduce wave- 
lengths shorter than 2 Ax. In a mono- energetic beam for example, T = 0 
and so Xd = 0. If we use ESI to model this system we find that the beam 
begins to spread in v space. Aliasing is feeding energy from A < Ax modes 
into modes supported by the mesh, heating the beam and thereby raising 
both T and Xd^ This thermal instability persists until A^^ grows sufficiently 
that the condition A^ << Ax is no longer satisfied. 


4.5 The Field Solver 


Formally, we can write 

00 

4 ’s{_^g) — y ^ Gs{Xg XgiyPs^Xg') 

g'=—Qo 


or in fc-space, 

Mk) = Gsik)ps{k). (37) 

The function Gs is a discrete representation of the Greens function for Pois- 
son’s equation. The transform of the exact (continuous) Greens function for 
= —p is 1/k^. If we use Gg(k) = l/k'^ then 


GA^g") - 2^ J_ 


1 r^gridl*^ 


^grid/*^ 


ikx„ 

'fc2 


where Xgii = Xg — Xgi = (g — g')Ax. Therefore 


Gs(x, 


= f / 

7T Jo 


kgridl'^ SVD.kx„ii 
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For g' = g this gives Gs = 0. When g' ^ g we have Gs{xgn) = —Gs{—Xgn) ^ 
0, ie in a;-space l/k"^ appears as a very non-local operator. If instead we use 
the finite difference approximation for 

4^g+\ ~ "t" 4^g—l ~ ~Pg^^ 

which transforms to 

- 2 -h == -Ax^p,{k) 

where 

oo 

^s{k) = ^x <f>s{xg)e~^’^^^, 

g=-oo 

giving 

= ft{‘)/K(^). (39) 

DifFerencing exaggerates the amplitude of higher k modes. 


4.6 Force Evaluation 


Recall that 



and 

E{k) = -ik^{k). 

Again we can choose to use these exact operators, or finite difference ap- 
proximations to them. If we replace dcfpjdx with 


2Ax 

we replace the exact operator, — tfc, in fourier space, with —ik sin (feAx)/fcAa:, 

E{k) = --ik sinc(AiAx) ^{k) (40) 

— ~it^{kAx) <^(/?). (41) 


In this case differencing acts to dampen high k modes. 
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Figure 13: The function k is introduced into the evaluation of the electric 
field when we use the centered second order finite difference approxima- 
tion. It acts to dampen the higher k modes. Similarly /K appears in the 
potential solver but this exaggerates the higher k modes. 


4.7 Optimal Schemes 

In the last few sections we have used fourier analysis to study why spatial 
discretization errors occur and how they may affect the overall solution. We 
can use this type of analysis to develop improved schemes in which the errors 
introduced at each stage can partially cancel each other. 

All four stages in our code which contribute to the evaluation of the force 
on the particles, ie charge assignment to the mesh, solving Poisson’s equation 
to obtain the potential, differencing the potential to derive the electric field 
and force values at the mesh points, and finally interpolating force values at 
the particle positions from the mesh values, can be expressed as convolutions 
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in x-space and so as products in fc-space. For example, rewriting 


E„ 


4>g+l - <l>g-l 

2Ax 


as 


9' 

(42) 

^D{Xg-Xg^)(t>g> 

(43) 


3' 

where the operator D is defined as 

D{Xg - Xg,) = ~{,6g^X-g' ~ 6 g .^.g,) J 2 l^x'^ . 

The fourier transform of equation (43) is 

E{k) = -D{k)^{k). 


Similarly interpolation of force to particle positions is given by 

3' 


which has transform 


Hk) = -^Wik)Eik). (44) 

Combining equations (43), (44), (36) and (37), gives 

nk) = -■^wik)Dik)G{k)^mmk) 

and so 

dkw{k)b{k)G{k)w{k)h{ky'’^- 

This equation enables us to combine the separate numerical steps in a way 
which produces the most ‘‘accurate” expression for F{x). 
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5 Energy Conservation and Collision Times 

The schemes we have considered so far are momentum conserving, but do 
not conserve energy, although the misconservation can be kept to acceptably 
small levels. Energy conserving schemes do exist, but they fail to conserve 
momentum. The crucial difference in energy conserving schemes is that VW 
or V when needed, are evaluated analytically rather than numerically. We 
should point out that the term “energy conserving” is a little misleading- 
leading in this context, since these schemes only conserve energy in the 
limit as At ^ 0. In other words, introducing time discretization breaks this 
conservation. 

In momentum conserving schemes the misconservation of energy is due 
to aliasing. Likewise in energy conserving schemes the misconservation of 
momentum is also due to abasing. If we eliminate aliasing by using a band 
limited cloud shape function S{x)^ we can conserve both energy and mo- 
mentum. However this is an expensive option because it generally requires 
a very high order weighting function W(x) which is highly non-local, and 
for this reason is almost never used. 

We have seen that abasing is less of a problem for schemes which use 
higher order charge assignment. Therefore, for the momentum conserving 
schemes which we have considered, we would expect the CIC scheme to 
conserve energy better than the NGP scheme. 

5.1 Heating Time 

The principal symptom of energy misconservation is particle heating. Hock- 
ney [6] made a systematic study of heating and coUision times for a 2D 
electrostatic plasma. Let hi(t) be the deviation of the kinetic energy of the 
particle from its initial value. The average over all particles is 

Wf)) = 

We define a heating time th such that 

('*(>■«)) = \kT. 

How do we determine (/i(r//)) ? Assume that the errors in our model con- 
tribute to a stochastic error field 8E. For simplicity we shall assume SE is 
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constant in magnitude but varies randomly in direction. For one timestep 
it introduces a momentum change 

mSv = qSEAt 

for each particle. Each particle describes a random walk in u-space from its 
initial velocity vq . Writing Av — v — vq, after n timesteps 

(Av) — 0 

(|A„P) = 

which implies 


\m{\vQ + Avp) - ^m(|u^|) 

(45) 

]^m{\Avl\) 

(46) 


(47) 


Stochastic heating increases linearly with the number of timesteps. 

Hockney has shown that the heating timescale is a complicated function 
of both AxjXd and The heating rate increases as either Ax/Xd or 

lOpeAt increases. CIC has a slower heating rate than NGP, by a factor of 
roughly 20. 

5 . 2 Collision Times 

The effective collision frequency Pc was determined in the 2D model by 
measuring the deflection of particles from their original direction. 

^ P i=\ 

The collision time tq is defined by 

= f 

Hockney [6] found the relationship 

^ - n{Xl + w^). 

^pe 
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Recall that Tpe is the shortest timescale associated with collective modes in 
the system. As the particle size ty 0 the coDisionality of the system is 
determined by the number of particles in the Debye circle (nA^), The finite 
size of particles helps to increase the collision time. Note that since w is the 
same for both NGP and CIC, both schemes have the same coUisionality. 

6 Higher Dimensions 

The basic steps are the same. However everything is more complicated 
to program and more costly to run. Anisotropies appear due to the use 
of square or cubic particle shapes, and due to directional dependencies in 
truncation errors of finite difference approximations. Waves propagate with 
differing ease in directions aligned with or between axes. These anisotropies 
can be reduced by using smoother cloud shape and assignment functions. 
Visualization of results is considerably more difficult, particularly in 3D. 


7 PIC for Compressible Fluid Flow 

PIC was originally invented by Harlow for this application. His motivation 
was to develop a tool to study highly distorted or sheared flows, or strongly 
shocked flows involving material interfaces and contact discontinuities in 2 
or 3D. Here the particles represent fluid elements. In time the approach was 
dropped in favor of improved fluid codes because it was 

• too noisy 

• had high numerical viscosity (momentum diffusion) 

• suffered from large heat conduction (energy diffusion). 

It has been revived recently in codes like FLIP which are low dissipation 
PIC codes. 

In its original incarnation, fluid PIC was a partially lagrangian method 
(mass was the only lagrangian variable). Modern fluid PIC is a fuUy la- 
grangian method (ie. mass, momentum and energy are all lagrangian vari- 
ables). 

For old style fluid PIC a typical timestep would be as follows: 

1. load Np particles, each with mass velocity and internal energy 
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2. set up a mesh. 


3. construct mesh point values of the fluid density pg^ velocity XJ g and 
internal energy Eg by assigning the equivalent particle quantities to 
the mesh using eg. 

Pa = ^Yl'rniW{xi - Xg). 

9 p 

4. solve the navier stokes equations on the mesh using finite differences. 

5. construct a velocity and energy for each particle using eg. 

Ui = J2^gW{Xg-Xi). 

9 


6. move the particles by solving 



7. begin cycle again at 1. 

Mass density information goes just one way, from the particles to the 
grid. Mass is alagrangian variable so mass diffusion is eliminated. However 
particle momentum and energy information is replaced every timestep by 
new values interpolated from the grid solution of the fluid equations. It 
is this transfer backwards and forwards which made the old style PIC so 
diflPusive. 

Modern fluid PIC, like FLIP, makes momentum and energy lagrangian 
variables also. This is achieved by solving fluid equations on a lagrangian 
grid, (ie. a grid which moves with the local fluid velocity), then updat- 
ing rather than replacing the particle velocities and energies using the grid 
solution. 

FLIP(Fluid Implicit Particle, Brackbill et al) is an implicit lagrangian 
code with adaptive re-zoning. It is less accurate than finite difference meth- 
ods where they apply. It is also more expensive. However it still retains its 
advantage over finite difference methods where contact discontinuities and 
material interfaces are important. 

EPIC(Ephemeral PIC, Eastwood) is an alternative low dissipation fluid 
PIC approach. It is finite element based, using an anti- diffusion step to 
remove momentum and energy diffusion. 
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8 Application to Incompressible Fluid Flow - the 
Vortex Method 


By definition, for incompressible flow 

Vp = 0 

and by implication from the mass continuity equation 

V • ti = 0 . 


We can therefore write u as 


•u = V X V’ 


where is the stream function. The momentum equation is 


du 


P 


dt 


Vp 


and the vorticity is defined as 


oj = V X u 


We can show that vorticity is a conserved property of a fluid element in 
incompressible flow, ie. 


d 1 __ 

-7-V X u = — V X Vp 

dt p 


and since V X V/ = 0 for any function / we have 


du: 

dt 


0 


In 2D, Uz = 0 and d/dz ^ so 


w = V X u = 



dj>z 
dx ’ 


0) 


So for an element of fluid to which we assign a vorticity the equations 
of motion are 


dx 

dt 


= u 
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u = 'V X {i>zk) 

= -Uz 

ie. the same equation structure as for a 2D electrostatic plasma. 3D vortex 
methods involve tracking vortex tubes. This is not a simple generalization 
of the 2D method. A comprehensive review of the vortex method is given 
by Leonard [5]. 

9 Parallel PIC 

To understand how to optimize a PIC code for a particular architecture we 
need to understand how the processors and data memory are configured. 
We can think of our parallel computer as a system of data memory which 
feeds data to and receives data from a set of arithmetic processing units. 
This memory system can include 

• registers on processor 

• cache on processor 

• distributed RAM. 

Each of these components have significantly different access times. The 
trick in optimizing the code is to tailor the algorithm and data layout to the 
machine in such a way as to keep the largest possible number of processors 
working effectively while reducing their access times to the data which they 
need. 

In the most abstract sense, we have an algorithm and a data structure 
to map to the architecture. The four basic steps in a PIC algorithm are 

1. assign particle charge(mass) to the mesh 

2. solve for the force field on the mesh 

3. interpolate force from the mesh to the particle positions 

4. push the particles. 

In combination these four steps involve computation and communication 
between two different data structures. The field data has the qualities of 
an ordered array in the sense that each element has specific neighbors. The 
particle data has the qualities of a randomly ordered vector, in which element 
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i refers to particle % and no element has any special relationship to its 
neighbors in the vector. 

Steps 2 and 4 are parallelizable in rather obvious ways, since they involve 
only simple and completely predictable data dependencies, and do not couple 
the two data structures. Steps 1 and 3 however do couple the two data 
structures, with complicated and unpredictable data dependencies which 
evolve during the simulation. It is these steps which invariably dominate 
the execution times of paraliel PIC codes. 

There is one further observation to make before we discuss implementa- 
tion specifics. On a serial machine our code will execute its computational 
workload in a time which is independent of any correlations in the spatial 
locations of the particles. This is not true on parallel machines. Particle 
clustering can create communication and/or computational hot-spots which 
impair performance. For example, an algorithm which works very efficiently 
for an homogeneous plasma application may be very inefficient for a highly 
clustered gravitational N-body problem. This can be an important factor 
in choosing an algorithm. 

On a specific parallel machine, many factors will influence a codes per- 
formance. Machine architecture is unquestionably the most important. It 
would be appealing to present the best techniques for each of the broad ar- 
chitecture classes, such as SIMD (Single Instruction Multiple Data), MIMD 
(Multiple Instruction Multiple Data) with distributed memory and MIMD 
with shared memory. However such a clean presentation would be mislead- 
ing. Even within these broad classes there are more minor architectural 
diflFerences which can cause us to favor different algorithms. Differences in 
compilers, both in terms of functionality and maturity,^ add to the problem. 

So we will not attempt such a general description of preferred parallel 
algorithms. Instead we wiU examine how PIC codes have been parallelized 
on three specific machines. These techniques should serve as suggestions 
rather than rules, when we approach other machines. The three paralleliza- 
tion tasks we will consider are vectorizing the code for a powerful vector 
machine such as the Cray YMP series, implementing it on the MasPar’s 
SIMD architecture, and on a distributed memory MIMD machine, Intel 
Touchstone Delta. 


®Many of the paraHei machines and parallel compilers are in the earliest stages of their 
development. 
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9.1 Vect orization 


Three of the four basic steps of a PIC code (steps 2,3 and 4 above) are 
almost trivially vectorizable. 

The particles are pushed by looping over the particles in sequence and 
pushing each in turn. Since each particle push is independent these loops 
vectorize. Highly efficient vectorized field solvers (eg fft’s or multigrid) exist 
in various libraries(eg IMSL,NAG,ELLPACK) and we need not consider this 
any further. The force interpolation is a gather operation. Again we can 
loop over the list of particles, fetching the force values for the cells in the 
neighborhood of the particle. There are no data dependencies which will 
inhibit vect orization. 

The only PIC code step which is not easily vectorizable is the charge 
deposition step. In a serial implementation we would loop over the list of 
particles in turn, determining where each particle is located in the mesh and 
then distributing contributions to the charge density of the mesh cells in the 
immediate vicinity of the particle. This loop will not vectorize automatically 
because it is possible that two particles might try to add charge to the same 
mesh cell at the same time. We can solve this problem in more than one way. 
The choice of solution depends on how often particles in the vector pipeline 
will try to add charge to the same mesh cell. If this is a regular occurrence 
then the best solution is to pre-sort the particles. We will illustrate one 
way to do this using an algorithm devised by Horowitz [7]. However if it is 
a rare occurrence then we can test for when this happens and only inhibit 
vectorization when those particles are in the vector pipeline. 

9. 1.1 Pre-sorting 

For simplicity consider a ID model using NGP weighting. Assume there 
are rig particles in mesh cell number g. In this approach these particles are 
numbered from 1 to rig. This is repeated for aU the cells in the mesh. From 
these lists we form a number of particle groups. Group one contains all the 
particles numbered 1, group two aU the particles numbered 2, and so on. 
Now loop over particle group 1. Each particle deposits its charge into the 
mesh cell to which it belongs. Since no two particles in group 1 belong in the 
same mesh cell we know that two particles can never be writing to the same 
array element at the same time, and we can force the loop over particles in 
the group to vectorize. We repeat this process in turn for all the particle 
groups. The only part of this algorithm which does not vectorize is the 
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original grouping of the particles. It is easily extended to higher dimensions 
and to cope with higher order charge assignment schemes such as CIC. 

Additional memory space is required to store the particle lists (and ex- 
tra charge density arrays in the case of higher order charge assignment). 
Horowitz discusses how the algorithm can be tuned to trade-off between 
CPU performance and memory usage as resources might dictate. 

9,1.2 Dependency Testing 

Again, let us consider the ID model with NGP weighting. The charge 
deposition would be achieved by the Fortran 77 loop 

do i=l,npart 

rho(index(i) ) = rho(index(i)) + q(i) 
enddo 

where index (i) is the mesh cell to which particle i with charge q(i) be- 
longs, and rho(j) is the charge in mesh cell j. This scatter- with-add loop 
has potential vector dependencies, since index is not known until run time 
and changes during the simulation. 

We break this loop into sections of length nblock. 

do il=l ,npart-nblock+l , nblock 
do i=il,il+nblock 

rho(index(i)) = rho(index(i)) + q(i) 
enddo 

enddo 

Now consider one of these short inner blocks. We will test this inner loop 
to see if any vector dependencies actually occur within it. First we set up 
a temporary integer array, itemp, which has as many elements as there are 
cells in the mesh. We also set up two temporary integer arrays la and ib of 
length nblock, with ia(i)=i. Now we use the appropriate nblock elements 
of index to scatter ia into itemp, forcing vectorization and accepting any 
overwrites which might occur. 

cdir$ivdep 

do i=il,il+nblock 

it emp( index (i)) - ia(i) 

enddo 
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Then we try to reverse the scatter operation, by gathering elements of itemp 
into ib, under the influence of index. 

do i=i 1 , i 1+nblock 
ib ( i) -it emp ( index ( i) ) 
enddo 

If no overwrites occured during the scatter operation, then the gather step 
exactly reverses the scatter and so ia and ib should be identical. In that 
case, for this particular block, we can safely force the inner loop to vectorize 
using a compiler directive. However if ia and ib differ anywhere, we have 
detected a vector dependency and the inner loop must execute in serial 
order. 

Based on the success or failure of this test for each block, we can branch 
to a copy of the inner loop either with or without a preceeding compiler 
directive to ignore vector dependencies. 

This scheme enables us to vectorize the charge deposition task over vector 
lengths nblock. Of course the test involves an additional overhead of a 
vectorized scatter and a gather step, so we would only consider it if we 
expected the test to find no vector dependencies much of the time. For a 
random spatial distribution of particles the chance that two or more particles 
in the same block will try to add charge to the same cell is determined by 
the ratio of nblock to the number of ceUs in the mesh. For performance 
reasons we would like to keep ia and ib in vector registers which means 
that nblock will be set to the vector pipeline length. On the Cray C90 that 
means nblock = 128. nblock will be much smaller than the number of mesh 
points for all but the smallest ID meshes. As a consequence, the benefit of 
the improved vectorization should far outweigh the overhead associated with 
the test. ^ 

No changes are required to apply this scheme in 2 or 3D, and minor 
modifications can accomodate higher order charge assignment schemes. 

9-2 SIMD Implementation 

The MasPar MP-2 has a SIMD architecture with up to 16384 (128x128) pro- 
cessors. The nominal peak performance of a 128 x 128 machine is 6.2Gflops. 
Each processor has 64Kb of dedicated data memory. The processors are ar- 
ranged in a 2D array with dimensions which are integer increments of 32, ie 

^ Cray’s cf77 compilation system automatically implements this solution. 
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32, 64, 96 or 128. Straightline connections, known collectively as the X-net, 
exist between processors in the north, south, east, west, north-east, south- 
east, south-west and north-west directions. At the edges of the processor 
array the X-net wraps around so that the array has the same topology as 
the surface of a torus. Inter-processor communications can be achieved in 
one of two ways. The global router can be used for more complex patterns 
or for communication between widely spaced processors, while for regular 
patterns over short distances the X-net communications are much more ef- 
ficient. The MasPar series broadens the definition of SIMD in at least one 
important way. It enables indirect addressing within a processor memory. 

On a distributed memory machine, such as the MasPar, data layout 
across processor memory is an integral part of algorithm design. For PIC 
codes, once we have chosen the layout of the field arrays and particle data 
we have essentially set the computational and communication workload for 
each processor during each of the steps of the code. The challenge on the 
MasPar is to spread the computation and communication workload as evenly 
as possible, while minimizing the amount of global router communication 
required.^^ 

The field arrays will be laid out so as to optimize the field solver routine. 
We do not need to consider the fine details of this layout, which will vary 
depending on the exact size of the physical mesh and the size of the processor 
array. All we really need to recognize is that any acceptable layout will 
establish a mapping between physical mesh points and the processor array 
so that it includes most if not aU the processors, and that nearest neighbor 
mesh cells will map to processors which are no further apart than nearest 
neighbors. For example if we have a 3D mesh of size 128 X 128 X 128 and 
a processor array of size N^roc — 128 X 128, we could map cell (i,j,k) into 
processor 

The major design question which faces us is how to distribute the particle 
data. There are some obvious choices which focus either on computational 
load balance or on efficient interprocessor communication [9]. 

9.2.1 Uniform Load Balance - with communication hotspots 

The first option is to parcel the particles out evenly amongst the proces- 
sors, paying no attention to their physical locations. This achieves the best 

A plural floating point multiply takes 40 clocks on the MP-2, an X-net operation send- 
ing a real number a distance of 1 processor takes 41 clocks, and a random communication 
pattern using the global router, with all processors participating takes 5000 clocks. 
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computational load balance during the particle push and during the purely 
computational parts of the charge deposition and force interpolation tasks. 
It also makes memory management easier, since we know exactly how much 
memory we will need in every processor. 

However it makes very heavy use of the global router for interprocessor 
communication. Any given particle can potentiaDy seek to deposit charge 
on any of the processors in the processor array. For example if we use CIC 
for a 3D model, each particle has 8 charge contributions to distribute to a 
2 X 2 X 2 block of elements somewhere in the charge array. We can pack these 
8 components into a message which is then sent by the global router to one 
of the processors storing the 2x2x2 block, which then distributes them, 
as required among its neighboring processors using the X-net. Similarily 
during the force interpolation the particle needs to interpolate between the 
24 field components associated with the same 2x2x2 block (ie 8 components 
for each of the x^ y and ^ directions). It is actually more efficient to pack 
the particles coordinates into a message, send them to a processor in the 
2x2x2 block, collect the 24 components there using X-net, compute the 
interpolated field values, pack them into a reply, and send the reply back to 
the originating processor using the router. 

This scheme is slow because of its extensive use of the global router, 
and it scales poorly in situations where clustering occurs. Communication 
hotspots occur when a large number of messages are being sent to the same 
processor at the same time. The processors can only process one router 
message at a time. If particles are actually located in cells which map 
into processor p, then processor p wiU need to receive messages during 
that timesteps charge deposition. This algorithm therefore wiU scale as 
^maa; ? maximum value of across the processor array. 

9.2.2 Uniform Load Balance - without communication hotspots 

We can improve the scaling of the charge deposition by using a combination 
of a sort and a segmented vector scan-add [3]. This approach is easiest to 
explain for NGP charge assignment in the simple case where we have Np 
particles and an equal number of processors. 

Before we start we associate a unique id number IDij = i + {j — l)Nx -|- 
(k — l)Ns^Ny with, each mesh cell (i, j, fc), where No; and Ny are the x and 
y dimensions of the mesh. As before we distribute the particles uniformly 
amongst the processors, with, in this example, one particle per processor. 
We think of the particle data and the mesh as long vectors. We set up 
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another integer vector containing the particle cell ids. This vector is sorted in 
order of increasing cell id and the permutation required to sort it is recorded. 
Now we apply the same permutation to the remaining particle data vectors. 
This permutation rearranges the one to one association between particles 
and processors. Because this is a permute operation it has no communication 
hotspots. It is performed using the global router. When this step is complete 
the particle vector is composed of a sequence of blocks of varying lengths. 
Each block is a contiguous list of particles whose spatial locations map to the 
same mesh cell, and each block is stored in a contiguous block of processors 
with one particle per processor. We can now use a segmented vector scan- 
add operation to sum the charge in each block. This can be written using 
X-net calls, and scales as In One possible choice of sorting algorithm 

is a split-radix sort which scales as In JVp. 

The force interpolation can also be handled using a scan function to 
achieve a scaling with The first particle in each block fetches the 

components of the field from the mesh cell corresponding to the particle’s 
cell id. Since no processor receives more than one request there are no 
communication hotspots during this fetch. Then we use a segmented scan- 
copy to copy these values to every other particle in that particle block. 

This scheme is easily extended to accomodate higher order charge de- 
position algorithms and cases where the number of particles and processors 
differ. It is slow because the sort operation is slow, and will not compete 
with the other schemes outlined here for simulations without severe particle 
clustering. However in cases with severe clustering it may be the only viable 
choice because it is free of communication hotspots. 

9,2,3 A Particle Migration Strategy 

The schemes we have outlined so far all make heavy use of the global router. 
We can avoid the router completely if we distribute the particles amongst 
the processors according to the same mapping used for the field arrays. 

If a particle lies in ceU we store it on the processor to which we 

mapped ceU During the charge deposition, no particle wiU need 

to send charge any further than to a neighboring processor, and during 
the force interpolation the mesh field values which the particle needs are 
either on processor or stored by a neighbor. This enables us to use X-net 
communications exclusively. To maintain this locality we are required to 
migrate particle information from processor to processor as the particles 
move between mesh ceUs. Since our timestep constraint limits the distance 
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any particle can travel during that timestep to less than one mesh cell width, 
the migration can be achieved efficiently using the X-net. 

There are of course drawbacks associated with this scheme. The ad- 
ditional code needed to perform the particle data migration makes the al- 
gorithm more difficult to program and debug. It also suffers from load 
imbalance as particle clustering develops. In this case both the communica- 
tion and computation costs scale as Processor memory management 

is tricky. We have to allocate enough memory that the most heavily popu- 
lated processor does not run out of memory. However this means that a lot 
of memory space in other processors will be allocated and never used. 

This scheme has proven to be significantly faster than the others we have 
described when applied to relatively uniform spatial particle distributions. 
This is a testament to the relative efficiency of X-net communications when 
compared to global router communications. 

One possible solution to its memory management weakness is to supple- 
ment this scheme with a backup routine similar to that used in the uniformly 
load balanced technique above. Two distinct particle populations are iden- 
tified, those which have been migrated successfully(population I) and those 
which have not(population II). We use the migration strategy wherever pos- 
sible. Any population I particle which tried to migrate to a processor whose 
memory was already full is left where it is and relabelled as population II. 
After we deposit the population I charge to the mesh, we use the global 
router to deposit charge from any population II particles. Similarily, when 
we have completed force interpolation for the population I particles we use 
the router approach to find the forces for any population II particles. At 
regular intervals (ie every 10 timesteps perhaps), we test to see if the pop- 
ulation II particles might now be placed into the correct processors and so 
transferred back to population I. This hybrid scheme enables us to minimize 
memory wastage. 

9.3 MIMD Implementation 

MIMD systems present us with a more coarse grained parallelism. A MIMD 
machine will have somewhere in the range of 10 to 2000 microprocessors, 
each capable of running their own instruction stream. In principle, this intro- 
duces the possibility of using control decomposition(ie farming out separate 
tasks to different processors) as well as data decomposition. In practice the 
mandatory time ordering of the separate tasks in a PIC code leaves too little 
flexibility to make much use of control decomposition. However the separate 
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instruction streams do enable SPMD (Single Program Multiple Data) style 
programming, and as we shall see this can prove useful in optimizing load 
balance. 

As with SIMD, our design goals are to keep as many of the processors 
as we can busy doing productive work. In choosing a data decomposition 
strategy we must bear in mind that we have many fewer processors than 
in the SIMD machine, that these processors .are considerably more powerful 
computationally than their SIMD counterparts, and come with larger local 
memory banks. As a result, the balance between computation and interpro- 
cessor communication implied by our data decomposition must be struck 
somewhat differently than in our SIMD approach. It may also be necessary 
to dynamically reconfigure the data decomposition as the solution evolves, 
since the load imbalance penalty for a poorly fitting decomposition becomes 
more severe when we are working with fewer processors. The overhead for 
this dynamic load balancing must be factored into our analysis. 

We will discuss two specific MIMD PIC implementations. The first uses 
separate domain decompositions for particle related operations and for the 
field solver, at the cost of the extra communication required to share data 
between the two. The second approach uses only one decomposition. Both 
can perform dynamic load balancing. 

9.3.1 A Dual Decomposition 

The first example we have chosen to study is the ‘General Concurrent PIC 
code’ developed by Liewer et al [10] [11]. They have run this code on a 
number of machines including the Intel Touchstone Delta. 

The Delta features 576 i860 microprocessors, each placed at a node of 
a regular 2D communication grid. Each processor has 16Mbytes of RAM, 
and in principle can achieve a peak of SOMflops (single precision), although 
sustained performance is typically < 10% of peak. 

Interprocessor communication is achieved by exchanging packaged mes- 
sages. This message passing can be either synchronized amongst the proces- 
sors or asynchronous. It is not significantly more expensive for a processor 
to communicate with a distant processor than with a near neighbor. There 
is a significant setup overhead associated with each message sent, so it is 
better to send a few long messages than a lot of short ones. The cost of a 
message increases with the length of the message, so it is advantageous to 
eliminate any unnecessary communication. 

For simplicity, we wiU assume a 2D model covering a rectangular physical 
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domain. The algorithm uses two separate data decompositions, the first to 
ensure that the particle push, charge deposition and force interpolation tasks 
are efFectively load balanced, and the second to ensure that the field solve 
is load balanced. 

First the physical domain is divided into sub-domains^ with one sub- 
domain assigned to each processor, and with roughly equal numbers of par- 
ticles in each sub-domain. For non-uniform particle distributions these sub- 
domains will not have equal areas and will contain different numbers of 
mesh points. Each processor is responsible for storing the data describing 
the particles in its sub-domain and for integrating their equations of motion. 
When a particle moves from one sub-domain to another the information de- 
scribing the particle must be migrated to the appropriate processor. The 
processors also store the values of the electric field and charge density at the 
mesh points in their sub-domain, including any guard cells immediately out- 
side the sub-domain boundaries which may be required. As the simulation 
evolves and particles move between sub-domains the sub-domain boundaries 
are adjusted at regular intervals to maintain roughly equal numbers of par- 
ticles in each sub-domain. This guarantees that the particle push will be 
very evenly load balanced. 

Because we are almost certain to have unequal numbers of mesh points 
in each of these ^primary’ sub-domains, the field solver will not be load 
balanced. Therefore a secondary domain decomposition is established to 
suit the requirements of the field solver. Exactly what this decomposition is 
wiU depend on the technique used to solve Poisson’s equation, but it is most 
likely to divide the mesh cells equally amongst the processors, in contiguous 
blocks. These ‘secondary’ sub-domains do not change during the simulation. 

At the start of a timestep the particles deposit their charge to the mesh 
cells in their primary sub-domain. Because this information is needed by 
the field solver, and because the primary and secondary sub-domains do not 
necessarily coincide, this requires some inter-processor communication. The 
data which processor A must send to processor B is packaged together into a 
send buffer at A and then sent to B . When it arrives it is unpacked and stored 
appropriately. The field solver executes and the electric field is determined 
for every mesh point in the secondary sub-domains. The communication 
pattern is then reversed and the field values for the mesh points in the 
primary sub-domains are updated. At this point the forces on the particles 
can be evaluated by interpolation from the mesh using local data only. Then 
the particles are pushed. The final step is to transfer particle information 
between processors for any particles which have moved to a different primary 
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stib- domain. Again this is done by packaging the information into longer 
messages in order to amortize the message start-np costs. 

The selection of sub-domains is chosen so as to minimize inter-processor 
communication. There is no unique preferred solution to this problem. It 
is model dependent. Generahy speaking, increasing the ratio of sub-domain 
area to boundary length (or volume to surface area in 3D) minimizes the 
percentage of particles which migrate between sub-domains. However if this 
reduces the overlap between primary and secondary sub-domains there will 
be more data to be communicated before and after the field solver is called. 
In some simulations particles tend to move in a preferred direction and then 
it pays to use long thin sub-domains aligned along this direction. 

The details of the dynamic adjustment of the primary sub-domain bound- 
aries obviously depend on how we choose to configure the sub-domains. Be- 
cause there is an overhead associated with this adjustment, it should only be 
done when the load imbalance has exceeded some threshold value for which 
the resulting gain in performance outweighs the overhead. Since we wish to 
equalize the number of particles in each domain we need to know how many 
particles are in each sub-domain. This information can be accumulated with 
almost no extra effort during the charge deposition task, by also calculating 
the particle number densities at the mesh points. Then each processor sums 
the number densities over all the mesh cells in its sub-domain. 

9.3.2 A Single Decomposition 

In this case a single domain decomposition is used for both particles and 
fields [12]. Since we do not keep a second copy of the field information this 
requires less memory than the dual decomposition strategy. Inter-processor 
communication is required to migrate particle information when particles 
change sub-domains, to exchange guard cell field and charge density values, 
and in the exchange of any field information required within the field solver. 

The load on processor i is assumed to be of the form 

Li = ANi + BMi 

where Ni is the number of particles and Mi is the number of mesh points 
in sub-domain L A and B are constants which reflect the relative com- 
putational and communication costs of particle related and mesh related 
operations respectively. The objective is to adjust the sub-domain bound- 
aries, and therefore Ni and Mi so that Li is the same for all sub-domains. It 
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is possible therefore that one processor might spend more of its time on par- 
ticles and less on mesh points than another processor, and yet both would 
complete the timestep at the same time. This benefit cannot be realised 
for electrostatic codes because the global character of Poisson’s equation 
requires that the processors be synchronized both at the start and finish of 
the field solver calculation. However electromagnetic codes can be built that 
solve just local finite difference approximations to Maxwell’s equations 
These require less stringent processor synchronization. For example, at the 
beginning of a timestep processors could accumulate current density (cur- 
rent density not charge density is required), update interior field values and 
push interior particles without requiring any information from neighboring 
sub-domains. Since this accounts for most of the computational effort a 
combined field/particle load balance is effectively possible. The remaining 
steps, exchanging guard cell information, updating sub-domain boundary 
field values and particles, and migrating particles would require some syn- 
chronization between processors. 


^^ Electromagnetic models can be set up so that the solutions to cV x E = —dBf di and 
cV X B ~ BBjdi -f J satisfy both Poisson’s equation and V ■ B = 0, provided both are 
satisfied by the inital fields [13]. 
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